16 research outputs found
Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding
This paper presents the use of non-autoregressive (NAR) approaches for joint
automatic speech recognition (ASR) and spoken language understanding (SLU)
tasks. The proposed NAR systems employ a Conformer encoder that applies
connectionist temporal classification (CTC) to transcribe the speech utterance
into raw ASR hypotheses, which are further refined with a bidirectional encoder
representations from Transformers (BERT)-like decoder. In the meantime, the
intent and slot labels of the utterance are predicted simultaneously using the
same decoder. Both Mask-CTC and self-conditioned CTC (SC-CTC) approaches are
explored for this study. Experiments conducted on the SLURP dataset show that
the proposed SC-Mask-CTC NAR system achieves 3.7% and 3.2% absolute gains in
SLU metrics and a competitive level of ASR accuracy, when compared to a
Conformer-Transformer based autoregressive (AR) model. Additionally, the NAR
systems achieve 6x faster decoding speed than the AR baseline.Comment: 8 pages, 1 figure, accepted at IEEE SLT202
On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments
This paper introduces a new method for multi-channel time domain speech
separation in reverberant environments. A fully-convolutional neural network
structure has been used to directly separate speech from multiple microphone
recordings, with no need of conventional spatial feature extraction. To reduce
the influence of reverberation on spatial feature extraction, a dereverberation
pre-processing method has been applied to further improve the separation
performance. A spatialized version of wsj0-2mix dataset has been simulated to
evaluate the proposed system. Both source separation and speech recognition
performance of the separated signals have been evaluated objectively.
Experiments show that the proposed fully-convolutional network improves the
source separation metric and the word error rate (WER) by more than 13% and 50%
relative, respectively, over a reference system with conventional features.
Applying dereverberation as pre-processing to the proposed system can further
reduce the WER by 29% relative using an acoustic model trained on clean and
reverberated data.Comment: Presented at IEEE ICASSP 202
On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training
In this paper, we explore an improved framework to train a monoaural neural
enhancement model for robust speech recognition. The designed training
framework extends the existing mixture invariant training criterion to exploit
both unpaired clean speech and real noisy data. It is found that the unpaired
clean speech is crucial to improve quality of separated speech from real noisy
speech. The proposed method also performs remixing of processed and unprocessed
signals to alleviate the processing artifacts. Experiments on the
single-channel CHiME-3 real test sets show that the proposed method improves
significantly in terms of speech recognition performance over the enhancement
system trained either on the mismatched simulated data in a supervised fashion
or on the matched real data in an unsupervised fashion. Between 16% and 39%
relative WER reduction has been achieved by the proposed system compared to the
unprocessed signal using end-to-end and hybrid acoustic models without
retraining on distorted data.Comment: Accepted to INTERSPEECH 202
Learning Noise Invariant Features Through Transfer Learning for Robust End-to-End Speech Recognition
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
We introduce a monaural neural speaker embeddings extractor that computes an
embedding for each speaker present in a speech mixture. To allow for supervised
training, a teacher-student approach is employed: the teacher computes the
target embeddings from each speaker's utterance before the utterances are added
to form the mixture, and the student embedding extractor is then tasked to
reproduce those embeddings from the speech mixture at its input. The system
much more reliably verifies the presence or absence of a given speaker in a
mixture than a conventional speaker embedding extractor, and even exhibits
comparable performance to a multi-channel approach that exploits spatial
information for embedding extraction. Further, it is shown that a speaker
embedding computed from a mixture can be used to check for the presence of that
speaker in another mixture.Comment: Accepted for Interspeech 202
Frame-wise and overlap-robust speaker embeddings for meeting diarization
Using a Teacher-Student training approach we developed a speaker embedding
extraction system that outputs embeddings at frame rate. Given this high
temporal resolution and the fact that the student produces sensible speaker
embeddings even for segments with speech overlap, the frame-wise embeddings
serve as an appropriate representation of the input speech signal for an
end-to-end neural meeting diarization (EEND) system. We show in experiments
that this representation helps mitigate a well-known problem of EEND systems:
when increasing the number of speakers the diarization performance drop is
significantly reduced. We also introduce block-wise processing to be able to
diarize arbitrarily long meetings.Comment: ICASSP 202